Exploring unsupervised machine learning methods to extract insights from medical narratives about older adults (age 65+) fall
Author
Ifechukwu Mbamali
Published
29-11-2023
Introduction
The analysis made use of embeddings, dimensionality reduction, clustering algorithms, network graphs and text summarization techniques to effectively identify, and understand themes from medical narratives on older adults falls.
Key findings: The use of embeddings in combination with dimensionality reduction techniques proved effective in extracting cluster themes. DBSCAN1 outperforms k-means in cluster identification. Patients in the “Alcohol-Related Head Injuries and Falls” group tend to be younger, while the “Atrial fibrillation related falls” group was generally older, and the “Syncope-Related HeadInjuries” group had a higher rate of severe cases compared to others. In comparison to the previous year (2021), cases involving “Head Injuries from Falls”, “Syncope-Related HeadInjuries” and “Rib Injuries from Falls” saw the most significant increase in the average number of cases.
1 Density-based clustering algorithm, introduced in Ester et al. 1996, which can be used to identify clusters of any shape in data set containing noise and outliers. DBSCAN stands for Density-Based Spatial Clustering and Application with Noise
Ultimately, insights gained through this analysis can help inform policies and interventions to reduce older adult falls. Competition hosted by Centers for Disease Control and Prevention.
Data Overview
The analysis made use of 2 data-sets:
Primary data-set
OpenAI embeddings data-set
Loading Data
Code
#recode the encoded variables in the dataset to human-readable valuesmapping <-fromJSON("data/variable_mapping.json")# Convert to data frames so we can use in joinsmapping_tables <-list()for (col innames(mapping)) { mapping_tables[[col]] <-data.frame(ind=as.integer(names(mapping[[col]])), # change to integer typesvalues=unlist(mapping[[col]]) )}
Code
# Load primary datapdf <-read.csv("data/primary_data.csv" )# Join and replace encoded columnfor (col innames(mapping)) { pdf <- pdf %>%left_join(mapping_tables[[col]], by=setNames("ind", col)) %>%mutate(!!col := values) %>%select(-values)}
#table view of the first 51 columns of the raw embeddings fileas.datatable(formattable(emb2c|>head(5) ), rownames = F,filter ='top',options =list(pageLength =10, autoWidth = F,order =list(list(2, 'desc'))#asc),class ='bootstrap')
Text cleaning and pre-processing
Some of the general preprocessing steps include:
New fields: Additional columns were introduced to the primary data-set to determine severity levels based on disposition column and another to categorize activities based on the narratives.
Replacing Medical Abbreviations: The narrative column was also processed by replacing abbreviations with full clinical definition to improve readability.
Note
Creating new columns: Severity Level and Activity
Code
#create a column called "severity_level" that says "severe" if the number 4 or 5 is contained in the "disposition" column, and "not severe" otherwisepdf <- pdf |>mutate(severity_level =ifelse(grepl("4|5", disposition), "severe", "not severe"))#create a column called "activity" that captures the text between "-" and "(" or "-" and "," if the term "ACTIVITY" is contained in the "product_1" column, and "others" otherwisepdf <- pdf |>mutate(activity =ifelse(grepl("ACTIVITY", product_1),sub(".*-(.*?)[(,].*", "\\1", product_1),"others"))#modifies the "activity" column by replacing "others" with "fainted" if "SYNCOPAL" is contained in the "narrative" columnpdf <- pdf |>mutate(activity =ifelse(activity =="others"&grepl("SYNCOPAL|DIZZY|WEAK|WEAKNESS|SYNCOPE", narrative), "fainted", activity))#modifies the "activity" column by replacing "others" with "WALKING" if "WALKING" is contained in the "narrative" columnpdf <- pdf |>mutate(activity =ifelse(activity =="others"&grepl("WALKING|WALK", narrative), "WALKING", activity))#modifies the "activity" column by replacing "others" with "STANDING" if "STANDING" is contained in the "narrative" columnpdf <- pdf |>mutate(activity =ifelse(activity =="others"&grepl("STANDING|STAND", narrative), "STANDING", activity))#modifies the "activity" column by replacing "others" with "SITTING" if "SITTING" is contained in the "narrative" columnpdf <- pdf |>mutate(activity =ifelse(activity =="others"&grepl("SITTING|SIT", narrative), "SITTING", activity))#modifies the "activity" column by replacing "others" with "Stair Navigation" if "FLIGHT" is contained in the "narrative" columnpdf <- pdf |>mutate(activity =ifelse(activity =="others"&grepl("FLIGHT|STAIRS", narrative), "Stair Navigation", activity))#modifies the "activity" column by replacing "others" with "RISING" if "FLIGHT" is contained in the "narrative" columnpdf <- pdf |>mutate(activity =ifelse(activity =="others"&grepl("GETTING|CHAIR|BED|STOOD UP", narrative), "RISING", activity))#---pdf <- pdf |>mutate(activity =ifelse(activity =="others"&grepl("SLIPPED|SLIP", narrative), "SLIPPED", activity))#---pdf <- pdf |>mutate(activity =ifelse(activity =="others"&grepl("TRIPPED|TRIP", narrative), "TRIPPED", activity))#---pdf <- pdf |>mutate(activity =ifelse(activity =="others"&grepl("BENDING|BENT|BEND|PICK UP|PICKING UP", narrative), "BENDING", activity))#---pdf <- pdf |>mutate(activity =ifelse(activity =="others"&grepl("MECH|MECHANICAL", narrative), "MECHANICAL", activity))#---pdf <- pdf |>mutate(activity =ifelse(activity =="others"&grepl("LOST BALANCE", narrative), "LOST BALANCE", activity))#---pdf <- pdf |>mutate(activity =ifelse(grepl("BASKETBALL|BASEBALL|BALL|SPORTS|SPORT|BILLIARDS|BOWLING|SKATING|GOLF|TENNIS|MOUNTAIN CLIMBING|SKIING|SOCCER|HOCKEY|FISHING|SWIMMING|MARTIAL ARTS|LACROSSE|TUBING|HORSEBACK RIDING|SURFING|WRESTLING|BADMINTON|SHUFFLEBOARD|FENCING", activity), "SPORTS", activity))
Cleaning Narratives by replacing medical & other abbreviations
Code
# Define the medical_terms dictionarymedical_terms <-list("&"="and","***"="",">>"="clinical diagnosis","@"="at","abd"="abdomen","af"="accidental fall","afib"="atrial fibrillation","aki"="acute kidney injury","am"="morning","ams"="altered mental status","bac"="blood alcohol content","bal"="blood alcohol level,","biba"="brought in by ambulance","c/o"="complains of","chi"="closed-head injury","clsd"="closed","cpk"="creatine phosphokinase","cva"="cerebral vascular accident","dx"="diagnosis","ecf"="extended-care facility","er"="emergency room","etoh"="ethyl alcohol","eval"="evaluation","fib"="fibrillation","fd"="fall detected","fx"="fracture","fxs"="fractures","glf"="ground level fall","h/o"="history of","htn"="hypertension","hx"="history of","inj"="injury","inr"="international normalized ratio","intox"="intoxication","l"="left","lac"="laceration","loc"="loss of consciousness","lt"="left","mech"="mechanical","mult"="multiple","n.h."="nursing home","nh"="nursing home","p/w"="presents with","pm"="afternoon","pt"="patient","pta"="prior to arrival","pts"="patient's","px"="physical examination","r"="right","r/o"="rules out","rt"="right","s'd&f"="slipped and fell","s/p"="after","sah"="subarachnoid hemorrhage","sdh"="acute subdural hematoma","sts"="sit-to-stand","t'd&f"="tripped and fell","tr"="trauma","uti"="urinary tract infection","w/"="with","w/o"="without","wks"="weeks")# Define the clean_narrative functionclean_narrative <-function(text) {# Convert text to lowercase text <-tolower(text)# Define regex pattern for DX regex_dx <-"([\\W]*(dx)[\\W]*)" text <-gsub(regex_dx, ". dx: ", text)# Define regex pattern for age and sex regex_age_sex <-"(\\d+)\\s*?(yof|yf|yo\\s*female|yo\\s*f|yom|ym|yo\\s*male|yo\\s*m)" age_sex_match <-regexpr(regex_age_sex, text)# Format age and sexif (age_sex_match >0) { age <-regmatches(text, age_sex_match)[[1]][1] sex <-regmatches(text, age_sex_match)[[1]][2]if ("f"%in% sex) { text <-gsub(age_sex_match, "patient", text) } elseif ("m"%in% sex) { text <-gsub(age_sex_match, "patient", text) } }# Translate medical termsfor (term innames(medical_terms)) {if (term %in%c("@", ">>", "&", "***")) { pattern <-paste0("(", gsub("[*]", "[*]", term), ")") text <-gsub(pattern, paste0(" ", medical_terms[[term]], " "), text) } else { pattern <-paste0("\\b(", gsub("[*]", "[*]", term), ")\\b") text <-gsub(pattern, medical_terms[[term]], text) } }# Capitalize sentences text <-gsub("(^|\\.[[:space:]]+)([a-z])", "\\1\\U\\2", text, perl =TRUE)# Convert text to uppercase#text <- toupper(text)return(text)}# Test the functioninput_text <-"The pt is a 45 yof who c/o abdominal pain. Dx: uti. She fell and has a left hip fx."cleaned_text <-clean_narrative(input_text)cat(cleaned_text)
The patient is a 45 yof who complains of abdominal pain. . Diagnosis: : urinary tract infection. She fell and has a left hip fracture.
Code
################################ speed notebook rendering ##################################run once or alternatively load "data/clean_narrative_data.csv"## applying cleaning function to data# pdf$narrative_orig = pdf$narrative# pdf_0 <- pdf %>%# mutate(narrative = map_chr(narrative, clean_narrative))
Code
################################ speed notebook rendering ################################## speed up render by saving file to excel and loading it up.#fwrite(pdf_0, "data/clean_narrative_data.csv")
In this section, text processing and text analysis tasks were performed on the cleaned narrative column.The code takes text data, removes certain specified words2 and stop words, tokenizes it into bigrams, counts the frequency of these bigrams, and calculates the percentage of occurrence for each bigram while performing various text cleaning and filtering operations along the way.
2 re-occurring words that do not provide any insightful information e.g”yom”, “yof” etc
Note
Code
#https://paldhous.github.io/NICAR/2019/r-text-analysis.htmlpdf5 = pdf_0 %>%#filter(activity=="Stair Navigation")|>mutate(narrative =gsub("\\bYOF\\b|\\bYOM\\b|\\bPT\\b|\\bDX\\b|\\byom\\b|\\byof\\b|\\bDx\\b|\\bDiagnosis\\b|\\bdx\\b|\\diagnosis", "", narrative,ignore.case =TRUE)) %>%unnest_tokens(word, narrative, token ="ngrams", n =2)%>%#split each word as a rowanti_join(stop_words)%>%#remove stop wordscount(word, sort =TRUE)
Joining with `by = join_by(word)`
Code
# remove stop wordspdf6 <- pdf5 %>%separate(word, into =c("first","second"), sep =" ", remove =FALSE) %>%anti_join(stop_words, by =c("first"="word")) %>%anti_join(stop_words, by =c("second"="word")) %>%filter(str_detect(first, "^[a-zA-Z]{3,}$") &str_detect(second, "^[a-zA-Z]{3,}$"))%>%mutate(percentage = n /sum(n) *100)
“Head Injury” is the most re-occurring pair of words in the Narrative data
Text network analysis can be used to represent the narratives as a network graph. The words are the nodes and their co-occurrences are the relations. With the narratives encoded as a network, advanced graph theory algorithms can be used to detect the most influential keywords, identify the main topics, the relations between them, and get insights into the structure of the discourse. By taking this approach, the focus is on the relations between the words, while retaining contextual information and the narrative. Unlike bag-of-words, LDA-based, or Word2Vec models which may lose information about the words sequence, text network can be built in a way that retains the narrative and, therefore, provides more accurate information about the text and its topical structure.
This section builds on the previous, by leveraging TextRank, which is based on the PageRank algorithm to extract sentences i.e. extractive text summarization. in this analysis, sentences are modelled as the vertices and words as the connection edges. So sentences with words that appear in many other sentences are seen as more important.
An Overview of the top 5 sentences based on the first 50 narratives for Adults older than 80 years
Code
#extracting the top 3 article_summary[["sentences"]] %>%arrange(desc(textrank)) %>%slice(1:5) %>%pull(sentence)
[1] "94 fell to the floor at the nursing home onto back of head sustained a subdural hematoma"
[2] "88 fell to the floor at the nursing home and sustained a laceration to face"
[3] "93 was walking at the nursing home and tripped and fell to the floor onto head sustained a closed head injury"
[4] "92 fell to carpeted floor at the nursing home and sustained a hip fracture"
[5] "95 fell to the floor at home and sustained a hip fracture"
Key takeaway(s)
Fall accidents tend to occur at the nursing home, for adults older than 85 years of age
Modelling: Identifying themes based on narrative embedding
In this section, two clustering algorithms K-means and DBSCAN were experimented with to test the efficacy in identifying theme clusters.
K-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst.The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized.
DBSCAN is a density-based clustering algorithm, which can be used to identify clusters of any shape in data set containing noise and outliers. The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.The goal is to identify dense regions, which can be measured by the number of objects close to a given point.
Processing step: Applying UMAP reduction step to the PCA processed data, to present data in 2-dimensional space.
#Specifying clustering models, arbitrarily set the number of clusters to 4kmeans_spec_best_emb <-k_means(num_clusters =4) %>%set_engine("ClusterR")#create workflowkmeans_wf_best_emb <-workflow()|>add_recipe(recipe_object_2)|>add_model(kmeans_spec_best_emb)#fit modelkmeans_best_fit_mdl_emb <- kmeans_wf_best_emb|>fit(data = emb2d) #emb2dkmeans_best_fit_mdl_emb
#extracting the top 3 article_summary_1[["sentences"]] %>%arrange(desc(textrank)) %>%slice(1:10) %>%pull(sentence)
[1] "65 was intoxicated blood alcohol content 151 and fell onto her head on the floor . closed head injury; acute alcohol intoxication"
[2] "65 fell and struck head on wooden floor while intoxicated with no blood alcohol level, . head injury, alcohol intoxication"
[3] "76 was intoxicated blood alcohol content 208 and fell to the bathroom floor onto head . closed head injury; acute alcohol intoxication"
[4] "74 was intoxicated and fell down a flight of stairs blood alcohol content of 198 . closed head injury acute ethyl alcohol intoxication"
[5] "74 ground level fall after drinking, fell onto table . scalp laceration, ethyl alcohol abuse no blood alcohol content"
[6] "75 drinking ethyl alcohol at a bar, passed out and fell to floor hitting face blood alcohol content not done. . facial fractures, closed head injury, acute ethyl alcohol intoxication"
[7] "80 per report patient presents after drinking ethyl alcohol tonight and fell from bed and not acting like himself blood alcohol level, 302 . fall ethyl alcohol intoxication forehead laceration"
[8] "77 was at home intoxicated blood alcohol content 205 and fell to the floor onto her head . subarachnoid hemorrhage; facial laceration; alcohol intoxication"
[9] "65 drinking alchol and fell down stairs hitting forehead with no blood alcohol level, done . laceration forehead, alcohol intoxication"
[10] "84 fell backwards and struck head on floor+ethyl alcohol,blood alcohol content>292--. laceration scalp"
Summary overview of all cluster themes
Cluster
Theme
Associated Activities
Obstacle
Injury
Top 3 Keywords (excluding the term “Fall”)
0
General Elderly Falls and Injuries
Lost Balance
Ladders, others not specified
Others
left, admit, contusion
1
Head Injuries from Falls
Standing, Rising
Bed or bed-frames
Laceration
injury, laceration, contusion
2
Falls Resulting in Shoulder Injuries
Tripped, Exercise, Sports
Exercise
Dislocation, Avulsion, Strain & Sprain
fracture, left, humerus
3
Hip Injuries from Falls
Rising, Tripped
Footwear
Fracture, Strain & Sprain
fracture, left, femur
4
Syncope-Related Head Injuries
Fainted
Toilets
Laceration
syncope, laceration, striking
5
Rib Injuries from Falls
Standing
Bath-tubs or Showers
Fracture
left, fracture, ribs
6
Alcohol-Related Head Injuries and Falls
Stair Navigation, others
Stairs or steps
Poisoning, Laceration
alcohol, blood, intoxication
7
Buttocks Contusions from Falls
Rising, sitting, slipped
bed or bed-frames
Contusions
contusions, buttocks, lower
8
Atrial fibrillation related falls
Sitting, Standing
Tables, rugs & carpets, Ceilings & Walls
Hermatomia
encounter, laceration, initial
9
Floor Falls and Associated Injuries
Walking, Slipped
Floors , balconies
Contusions
Falling, Floor, Dizzy
Further Exploration
In this section, building on the understanding of the cluster themes, these themes are further explored in relation to other variables like Age, severity4 level, sex etc
#Cluster themes by age distributionplot_2 =ggplot(pdf_cluster_emb_merge_db) +aes(x = db_cluster, y = age) +geom_boxplot(fill ="#AEC8DF") +labs(x ="Cluster themes",title ="Cluster themes by age distribution") +theme_minimal()ggplotly(plot_2)
Code
#Cluster themes by Severity levelsplot_3 =ggplot(pdf_cluster_emb_merge_db) +aes(x = db_cluster, fill = severity_level) +geom_bar() +scale_fill_brewer(palette ="Blues", direction =1) +labs(x ="Cluster themes", y ="Number of narrative", title ="Cluster themes by Severity levels", caption ="..", fill ="Severity Level") +theme_minimal()ggplotly(plot_3)
Code
plot_4 =ggplot(pdf_cluster_emb_merge_db) +aes(x = db_cluster, fill = sex) +geom_bar() +scale_fill_brewer(palette ="Blues", direction =1) +labs(x ="Cluster Themes", y ="Number of Narratives", title ="Cluster themes by sex") +theme_minimal()ggplotly(plot_4)
Code
plot_5 = pdf_cluster_emb_merge_db %>%filter(!(location %in%"UNK")) %>%ggplot() +aes(x = db_cluster, fill = location) +geom_bar() +scale_fill_brewer(palette ="Blues", direction =1) +labs(x ="Cluster Theme", y ="Number of Narratives", title ="Cluster themes by incident location") +theme_minimal()ggplotly(plot_5)
Code
plot_6 = pdf_cluster_emb_merge_db|>group_by(treatment_date,db_cluster)|>summarise(cases =n() )plot_6a =ggplot(plot_6) +aes(x = treatment_date, y = cases, colour = db_cluster) +geom_line() +scale_color_hue(direction =1) +labs(y ="Cases", title ="Trend of Cluster Themes", color ="Cluster themes") +theme_minimal()ggplotly(plot_6a)
Summary table view of the average number of cases for each cluster theme across the different years
Combining embeddings with dimensionality reduction techniques has proven to be highly effective in the extraction of cluster themes.
DBSCAN outperforms k-means in cluster identification.
Patients in the “Alcohol-Related Head Injuries and Falls” group tend to be younger, while the “Atrial fibrillation related falls” group was generally older
The “Syncope-Related HeadInjuries” group had a higher rate of severe cases compared to other groups.
In comparison to previous year (2021), cases involving “Head Injuries from Falls”, “Syncope-Related HeadInjuries” and “Rib Injuries from Falls” saw the most significant increase in the average number of cases.
Appendix
Disposition classification from which the severity levels were derived from
Disposition Code
Category
1
Not Severe
2
Not Severe
4
Severe
5
Severe
6
Not Severe
8
Severe
9
Not Severe
In this classification:
“Severe” includes disposition codes 4 (Treated and admitted for hospitalization), 5 (Held for observation), and 8 (Fatality, including DOA and deaths in the ED or after admission).
“Not Severe” includes disposition codes 1 (Treated and released, or examined and released without treatment, or transfers for treatment to another department of the same facility without admission), 2 (Treated and transferred to another hospital), 6 (Left without being seen, Left against medical advice, Left without treatment, Eloped), and 9 (Not recorded).